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1.  Introduction 


Lowest  possible  power  dissipation  is  a  requirement  for  many  Department  of  Defense  (DoD)  and 
commercial  circuits.  For  DoD  applications,  low  power  is  demanded  by  man-portable,  missiles,  munitions, 
and  satellite  applications.  For  commercial  applications,  low  power  is  demanded  by  portable  electronic 
computer,  entertainment,  medical,  and  communication  systems.  The  Power  Estimation  and  Synthesis  For 
Low  Power  project  investigated  a  wide  variety  of  circuit  design  techniques,  power  estimation  algorithms, 
design  algorithms,  and  advanced  logic  circuit  types  that  may  be  used  to  implement  low  power  electronic 
systems. 
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2.  Average  Dynamic  Power  Estimation 

In  this  research  we  developed  techniques  (and  corresponding  software  tools)  to  estimate  power 
dissipation  in  digital  CMOS  circuits.  The  main  contributions  and  achievements  are  listed  below.  Software 
tools  were  developed  to  estimate  power  at  different  levels  of  design  abstraction. 

•  Efficient  and  accurate  techniques  to  estimate  power  in  combinational  circuits 

•  Efficient  and  accurate  techniques  to  estimate  power  in  sequential  circuits 

•  Estimation  of  bounds  on  power  based  on  power  sensitivity 

•  Developed  and  implemented  power  macromodeling  technique 

2. 1  Techniques  to  Estimate  Power  in  Combinational  Circuits 

Symbolic  and  statistical  techniques  have  been  developed  to  accurately  estimate  power  dissipation 
considering  simultaneous  switching,  temporal,  and  spatial  signal  correlations. 

The  basic  idea  of  the  symbolic  method  is  to  express  the  signal  probability  (probability  of  a  signal  being 
logic  ONE)  and  signal  activity  (probability  of  signal  switching)  of  each  internal  node  in  teims  of  the 
probability  and  activity  of  primary  inputs  so  that  spatial  correlation  between  internal  nodes  can  be  handled. 
Results  show  that  power  dissipation  determined  by  our  technique  is  on  the  average  within  2%  of  logic 
simulation  results.  Ignoring  simultaneous  switching  can  introduce  an  error  on  the  order  of  over  21%. 

The  basic  idea  of  Monte-Carlo  based  statistical  method  to  estimate  power  dissipation  is  to  simulate  a 
circuit  with  random  patterns  applied  to  primary  inputs.  Such  random  patterns  conform  to  the  given 
probabilities  and  activities  of  primary  inputs.  The  number  of  simulations  are  determined  by  user-specified 
parameters,  such  as  confidence  levels  and  errors  that  can  be  tolerated.  The  statistical  technique  can 
handle  different  delay  models  for  logic  gates  so  as  to  include  spurious  transitions  in  its  analysis.  Due  to 
presence  of  different  delay  paths  converging  to  logic  gates,  spurious  transitions  can  occur.  This  in  turn 
increases  power  dissipation.  Results  indicate  that  spurious  transition  can  account  for  more  than  50%  of 
power  dissipation  for  some  benchmark  circuits. 

2.2  Techniques  to  Estimate  Power  in  Sequential  Circuit 

Probabilistic  and  statistical  techniques  have  been  implemented  to  estimate  power  dissipation  in 
sequential  circuits.  Due  to  the  feedback  of  inputs  from  the  next  state,  the  estimation  techniques  for 
combinational  and  sequential  circuits  are  quite  different. 

Techniques  to  estimate  signal  probability  and  activity  works  as  follows.  Given  the  STG  (state 
transition  graph)  of  a  sequential  circuit  or  an  FSM  (finite  state  machine),  we  build  an  Extended  State 
Transition  Graph  (ESTG)  and  calculate  the  probability  of  a  state  of  the  ESTG.  The  signal  activities  are 
then  estimated  from  the  ESTG. 

The  exact  method  may  require  solving  for  a  linear  system  of  equations  of  size  2N+M  where  N  is  the 
number  of  primary  inputs  and  M  is  the  number  of  flip-flops.  For  large  circuits  with  large  number  of 
primary  inputs  and  flip-flops,  the  exact  method  is  not  computationally  feasible.  Therefore,  we  propose  an 
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approximate  method  which  takes  temporal  correlations  of  primary  inputs  into  account.  We  unroll  the 
circuit  k  times  to  calculate  the  probabilities  and  activities  of  internal  nodes.  Results  indicate  that  this 
technique  can  have  an  accuracy  of  90%  while  being  several  orders  of  magnitude  faster  than  logic 
simulation. 

Statistical  techniques  for  sequential  circuits  consider  Near-Closed  (NC)  sets  of  states.  A  set  of  states  is 
called  "near-closed"  if  the  probability  of  being  in  that  set  of  states  is  high  if  the  starting  state  is  in  the  set  of 
"near  closed"  set.  Techniques  to  determine  warm-up  period  and  stopping  criteria  for  Monte  Carlo  based 
statistical  simulations  have  been  determined  under  the  presence  of  NC  sets.  The  computation  time  of  state 
probability  can  be  reduced  by  50%  (compared  to  standard  Monte  Carlo  technique)  by  the  proposed 
method.  The  relative  error  of  the  estimated  individual  node  activity  by  the  Monte  Carlo  based  technique 
with  a  warm-up  period  is  within  3%  of  the  result  obtained  by  long  ran  logic  simulation. 

2.3  Estimation  of  Bounds  on  Average  Power 

Power  dissipation  in  CMOS  circuits  is  heavily  dependent  on  the  input  signal  distribution.  However,  due 
to  uncertainties  in  specification  of  the  input  signal  distribution  the  average  power  dissipation  should  be 
specified  between  a  maximum  and  a  minimum  possible  value.  Due  to  the  complex  nature  of  the  problem, 
it  is  practically  impossible  to  use  traditional  power  estimation  techniques  to  determine  the  bounds.  Power 
sensitivity,  defined  as  the  change  in  average  power  due  to  changes  in  the  specification  of  primary  inputs, 
can  be  used  to  accurately  estimate  the  maximum  and  minimum  bounds  for  average  power. 

Both  symbolic  and  statistical  techniques  have  been  developed  to  estimate  power  sensitivity  as  a  by¬ 
product  of  average  power  estimation,  thereby  leading  to  efficient  implementation.  Our  results  on  ISCAS 
and  MCNC  benchmark  circuits  indicate  that  for  some  circuits  power  dissipation  can  be  very  sensitive  to 
some  primary  inputs.  A  small  variation  in  signal  distribution  can  cause  power  dissipation  to  change 
drastically.  Results  on  minimum  and  maximum  average  power  show  that  such  bounds  can  vary  widely  if 
the  primary  input  probabilities  and  activities  are  not  specified  accurately. 

2.4  Power  Macromodeling  Technique 

In  order  to  shorten  design  time  and  to  reduce  design  iterations,  we  have  to  estimate  the  power 
dissipation  at  a  high  level  of  abstraction  to  ensure  that  the  strict  power  requirement  of  a  future  design  is 
satisfied.  One  of  the  main  objectives  of  this  research  is  to  develop  a  power  macromodel  for  a  module  so 
that  power  dissipation  can  be  obtained  under  any  distribution  of  primary  inputs.  When  the  same  module  is 
reused,  we  can  obtain  its  power  simply  by  using  a  look-up  table.  Since  the  power  dissipation  of  a  circuit  is 
strongly  dependent  on  the  statistics  of  primary  inputs,  the  relationship  of  power  versus  primary  input 
probabilities  and  activities  is  a  complicated  surface.  Once  such  a  surface  is  constructed,  power  dissipation 
under  any  distribution  of  primary  inputs  can  be  easily  obtained. 

A  straightforward  way  is  to  approximate  such  a  power  surface  using  a  large  number  of  discrete  points. 
The  more  points  one  chooses,  the  more  accurate  the  result  one  can  obtain.  However,  more  points  directly 
translate  to  longer  CPU  time. 

Power  sensitivity  can  be  used  to  efficiently  develop  a  power  macromodel.  The  power  surface  can  be 
approximated  by  planes  which  are  constructed  by  a  representative  point  with  power  sensitivities.  Results 
for  power  dissipation  under  any  distribution  of  primary  inputs  demonstrated  the  accuracy  and  efficiency  of 
this  technique. 
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3.  Leakage  Power  Estimation  and  Control 

With  the  scaling  down  of  supply  and  transistor  threshold  voltage,  the  power  dissipation  due  to  sub¬ 
threshold  leakage  can  be  high.  We  developed  and  implemented  a  software  tool  to  estimate  the  leakage 
power  in  both  stand-by  and  active  mode  of  operation.  We  developed: 

•  Leakage  current  model 

•  Techniques  to  estimate  leakage  power 

•  Techniques  to  control  leakage  current  in  future  generation  IC's 

Low  supply  voltage  requires  the  device  threshold  to  be  reduced  in  order  to  maintain  performance.  Due 
to  the  exponential  relationship  between  leakage  current  and  threshold  voltage  in  the  weak  inversion  region, 
leakage  power  can  no  longer  be  ignored.  We  present  a  technique  to  accurately  estimate  leakage  power  by 
accurately  modeling  the  leakage  current  in  transistor  stacks.  The  standby  leakage  current  model  has  been 
verified  by  HSPICE.  We  demonstrate  the  dependence  of  leakage  power  on  primary  inputs.  Based  on  our 
analysis  we  can  determine  good  bounds  for  leakage  power  in  the  standby  mode.  As  a  by-product  of  this 
analysis,  we  can  also  determine  the  set  of  input  vectors  which  can  put  the  circuits  in  the  low-power 
standby  mode. 

3 . 1  Leakage  Current  Model 

The  accuracy  of  leakage  power  estimation  is  critically  dependent  on  the  standby  leakage  current  model. 
We  have  developed  a  general  model  of  leakage  current  for  transistors  connected  in  a  stack.  This  model 
considers  the  general  case  of  transistor  stacks  of  arbitrary  height.  It  takes  into  account  both  body  effect 
and  DIBL  (Drain  Induced  Barrier  Lowering).  DIBL  (reduction  of  threshold  voltage  as  VDS  increases)  is 
especially  significant  for  sub-micron  devices.  The  leakage  of  a  transistor  stack  is  shown  to  directly  depend 
on  the  magnitude  of  the  DIBL  effect.  The  standby  leakage  current  model  has  been  verified  by  HSPICE. 

3.2  Techniques  to  Estimate  Leakage  Power 

Considering  reverse  biasing  between  gate  and  source  in  transistor  stacks,  DIBL,  and  the  body  effbct, 
the  leakage  power  of  a  circuit  depends  on  primary  input  combinations.  Hence,  the  leakage  power  of  a 
circuit  should  be  specified  between  a  minimum  and  a  maximum  possible  value. 

Genetic  algorithm  and  a  deterministic  approach  have  been  developed  to  effectively  search  for  bounds 
on  leakage  power.  Unlike  random  search  techniques,  the  above  approaches  produce  considerably  tighter 
bounds  for  leakage  power  dissipation  in  the  stand-by  mode  of  operation. 

3.3  Leakage  Control  Techniques 

Since  the  leakage  power  of  a  circuit  depends  on  primary  input  combinations,  the  primary  inputs 
corresponding  to  the  minimum  leakage  power  can  be  applied  to  the  circuit  during  standby  mode  to 
minimize  the  leakage  power,  and  thereby  leading  to  a  reduction  in  total  power  consumption.  Results  for 
minimum  and  maximum  leakage  power  indicate  that  for  some  circuits  leakage  power  can  vary  widely  with 
different  primary  input  combinations.  Applying  the  best  primary  input  combination  to  a  circuit  during 
standby  mode  will  significantly  reduce  the  leakage  power. 
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4.  Peak  Power  Estimation 


With  the  high  demand  for  reliability  and  performance,  accurate  estimation  of  maximum  instantaneous 
power  dissipation  in  CMOS  circuits  is  essential  to  determine  the  IR  drop  on  supply  lines  and  optimizing  the 
power  and  ground  routing.  Unfortunately,  the  problem  of  determining  the  input  patterns  to  induce 
maximum  current,  and  hence,  the  maximum  power,  is  NP-complete.  Even  for  circuits  with  small  number 
of  primary  inputs  (Pi's),  it  is  CPU  time  intensive  to  conduct  exhaustive  search  in  the  input  vector  space. 

In  this  research,  we  developed  an  Automatic  Test  Generation  (ATG)  based  technique  to  efficiently 
generate  tight  lower  bounds  of  the  maximum  instantaneous  power  for  CMOS  circuits  with  non-zero  gate 
delays.  Power  dissipation  due  to  spurious  transitions  has  been  considered  by  incorporating  static  timing 
analysis  into  the  estimation  process.  Experiments  were  performed  on  ISCAS  and  MCNC  benchmarks. 
Results  show  that  the  ATG-based  technique  is  superior  to  the  traditional  simulation-based  technique  in 
both  speed  and  performance.  For  example,  for  ISCAS89  sequential  benchmark  circuits  having  over 
10,000  gates,  the  ATG-based  approach  is  on  an  average  80%  better  and  26192%  faster. 
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5.  Synthesis  of  Low-Power  Logic 

In  this  research  we  introduce  algebraic  procedures  for  node  extraction  and  factorization  that  target  low 
power  consumption  in  combinational  logic  circuits.  New  cost  function  is  also  proposed  for  the  sum-of- 
products  representation  of  the  expressions.  This  cost  function  is  used  to  guide  the  power  optimization 
procedures.  The  spatial  and  temporal  correlations  of  signals  were  taken  into  account  to  gain  accurate 
power  estimation.  The  results  show  that  an  average  of  15%  savings  in  power  using  logic  synthesis  with 
the  proposed  accurate  power  estimation  technique,  compared  to  area  optimized  designs. 

We  have  also  developed  a  transistor  reordering  technique  to  achieve  for  low-power  under  performance 
constraints.  The  technique  is  based  on  signal  activities  at  the  internal  nodes  of  logic  gates.  Results  show 
that  on  an  average  7%  improvement  in  power  can  be  acheived  with  no  or  minimal  area  increase.  Hence, 
this  technique  virtually  comes  for  free. 
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6.  High-Performance  Low-Power  Complex  CMOS  Logic  Styles 

In  this  research  we  consider: 

•  Logic  styles  which  concentrate  only  on  the  extreme  performance  end  and  handle 

issues  imposed  by  scaled  VXH  and  VDD. 

•  High  performance  coupled  with  low  idle  power. 

•  Dynamic  complex  gate  logic  family  (DCSL). 

•  Quantifying  dynamic  noise  immunity. 

•  Monotonic  static  CMOS  structures:  performance  coupled  with  low  leakage  power  by  exploiting 
stacked  transistors. 

•  Power  reduction  on  long  lines  using  split  gate  structures  to  reduce  voltage  swings. 

Integrated  circuits  and  systems,  over  the  years  have  gained  increasing  degrees  of  complexity, 
performance  and  functionality.  The  core  technology  which  spurs  this  has  been  the  improvement  of 
processing  technology  which  has  scaled  device  features  into  the  deep  sub-micron  range.  While  device 
transconductance  has  improved,  increased  leakage  levels  due  to  falling  threshold  voltages  vTH  and  lower 
vDD  to  vXH  ratios,  has  made  operation  with  traditional  high-performance  dynamic  logic  styles  difficult.  The 
high  active  power  of  circuits  operating  at  the  highest  speeds  has  led  to  increased  amount  of  power 
management  being  applied.  This  in  turn  has  led  to  these  circuits  spending  an  increased  amount  of  time  in 
an  idle  state  where  leakage  power  is  of  importance.  Logic  styles  which  simultaneously  maintain  or 
improve  performance,  are  tolerant  to  leakage  and  feature  low  standby  power  are  important.  Logic  styles 
chosen  allow  large  fan-in.  This  is  in  common  with  most  high  performance  dynamic  logic  styles  e.g. 
Domino,  where  a  considerable  fraction  of  the  performance  improvement  is  due  to  high  fan-in  gates 
allowing  shorter  logic  depths. 

6.1  Differential  Current  Switch  Logic  (DCSL) 

DCSL  is  a  dynamic  logic  style  which  features  improved  performance  with  lowered  active  power. 
Topologically  it  is  a  differential  cascode  voltage  switch  circuit,  with  additional  transistors  to  automatically 
lock  out  inputs  once  evaluation  is  completed.  Improved  performance  is  achieved  by 

•  Allowing  large  high  fan-in  gates  with  little  impact  on  speed. 

•  Large  fan-in  is  exploited  in  reducing  logic  depths. 

Simultaneously  lower  power  is  realized  by  restricting  the  voltage  swing  at  internal  nodes. 

Advantages  of  DCSL  were  quantified  on  the  critical  path  of  a  64  bit  adder.  Logic  depth  fell 
from  6  to  4,  with  performance  improvement  of  26%  and  power  improvement  of  22%. 

6.2  Quantifying  Noise  Immunity  of  Gates 

A  desirable  requirement  of  high  speed  logic  gates  is  their  ability  to  implement  simple  sequential  clocked 
pipelines  -  the  clocked  sequential  design  style  being  the  dominant  design  method.  Increasing  difficulty  in 
generating  clocks  with  low  skews  has  led  to  locally  self  clocked  circuit  styles  being  employed.  Such  pulsed 
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logic  circuits  spend  a  very  short  period  of  time  in  their  evaluate  stage.  The  evaluate  time  is  an  integer 
number  of  gate  delays  (as  governed  by  the  delay  for  forcing  the  gate  to  precharge).  Traditional  static  noise 
margin  analysis  yields  a  very  distorted  picture  of  the  noise  immunity  of  such  gates.  By  considering  all 
capacitively  coupled  noise  sources,  we  show  that  a  trapezoidal  model  for  injected  noise  pulses  is  an 
adequate  method  for  estimating  the  immunity  of  such  gates  to  coupled  noise  sources.  The  analysis  is 
particularly  relevant  in  highlighting  the  importance  of  monotonic  CMOS  structures  for  high  speed  circuits. 

6.3  Stacking  Effects 

As  mentioned  earlier,  we  consider  the  reduction  of  leakage  power  to  be  important  for  high  speed 
circuits,  since  they  are  aggressively  power  managed.  The  stacking  effect  -  lower  leakage  currents  in  the 
presence  of  a  stack  of  MOS  transistors  -  is  an  effective  circuit  technique  to  reduce  leakage.  Results 
measured  on  an  8X8  bit  carry  save  multiplier  show  that  such  techniques  can  give  7x  reductions  in  leakage 
current  over  a  wide  range  of  temperature. 

6.4  Ratioecl  Static  CMOS 

Ratioed  static  CMOS,  operates  conventional  static  CMOS  in  a  precharge  evaluate  pulsed  mode.  A 
single  transition  sense  is  present  at  the  gate  outputs  and  allows  the  gates  to  be  preferentially  skewed  to 
speed  up  the  evaluate  transition.  By  placing  series  connected  MOS  devices  in  the  precharge  paths  and 
mostly  parallel  devices  in  the  evaluate  we  achieve  the  following: 

•  Higher  speed  due  to  reduced  capacitive  loads,  preferentially  skewed  switch  thresholds,  and  lower 
crow-bar  currents. 

•  Lower  leakage  power  since  the  circuit  in  its  evaluate  state  switches  of  series  devices  in  precharge 
paths  leading  to  a  lower  leakage.  Additionally  the  noncritical  nature  of  this  path  allows  the  use  of 
lower  leakage  devices,  either  through  longer  channel  lengths,  back  biased  wells,  or  elevated  VXH. 

6.5  Power  Reduction  in  Long  Lines  Using  Split  Gate  Structures 

While  ratioed  static  CMOS  allows  high  speed  for  cases  where  the  gate  capacitance  dominates,  it  does 
not  provide  substantial  benefits  if  the  interconnect  is  the  dominant  load.  By  reducing  the  voltage  swing  on 
long  lines  we  achieve  simultaneous  advantages  of  speed  and  power  improvement.  We  note  that  a  VTH 
drop  exists  from  gate  output  to  internal  nodes.  Splitting  gates  to  make  the  internal  nodes  drive  the  actual 
line  restricts  the  voltage  swing  on  the  lines  to  VDD  -  Vth-  The  method  is  especially  useful  when  the  VDD  to 
\TH  ratio  is  low. 

6.6  Multiple  Supply  Design 

Scheduling  and  allocation  techniques  have  been  developed  for  DSP  datapaths  which  uses  multiple 
supply  voltages  during  scheduling.  Under  such  scheduling  strategy,  different  functional  blocks  (such  as 
multipliers,  adders,  etc.)  are  allowed  to  run  at  optimum  supply  voltages.  Voltage  converters  are  required 
between  different  functional  blocks  running  different  supply  voltages.  Results  show  that  more  than  50% 
improvement  in  power  dissipation  with  no  degradation  in  performance. 
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7.  Dual-V,/,  Circuit  Design 


In  this  research  we  achieved  the  following: 

•  Developed  dual  threshold  circuit  design  methodology. 

•  Developed  different  dual  threshold  circuit  schemes. 

•  Several  dual-Vth  assignment  algorithms  have  been  developed  to  achieve  the  best  leakage  savings 
under  performance  constraints. 


In  CMOS  digital  circuits,  power  dissipation  consists  of  dynamic  and  static  components.  Since  dynamic 
power  is  proportional  to  the  square  of  supply  voltage  VM  and  static  power  is  proportional  to  Vdd,  lowering 
supply  voltage  is  the  most  effective  way  to  reduce  power  consumption  as  long  as  dynamic  power  is 
dominant.  With  the  lowering  of  supply  voltage,  transistor  threshold  voltage  should  also  be  scaled  in  order 
to  satisfy  the  performance  requirements.  Unfortunately,  such  scaling  can  lead  to  a  dramatic  increase  in 
leakage  power  due  to  the  sub-threshold  leakage  current. 

Dual  threshold  technique  can  be  used  to  reduce  leakage  power  by  assigning  a  high  threshold  voltage  to 
some  transistors  on  non-critical  paths,  while  critical  paths  are  assumed  to  have  low-threshold  transistors. 
Therefore,  both  high  performance  and  low  power  can  be  achieved  simultaneously.  However,  due  to  the 
complexity  of  a  circuit,  not  all  the  transistors  on  non-critical  paths  can  be  assigned  a  high  threshold  voltage, 
otherwise,  the  critical  path  may  change,  thereby  increasing  the  critical  path  delay. 

In  order  to  achieve  the  best  leakage  power  saving  under  performance  constraints,  we  present  a  dual-  V7,/, 
design  methodology.  Different  dual  threshold  circuit  schemes  are  considered  and  several  dual-Vrf; 
assignment  algorithms  are  provided.  A  standby  leakage  model  which  has  been  verified  by  HSPICE 
simulations  is  used  to  estimate  the  standby  leakage  power  of  a  circuit. 

7.1  Dual-V,h  Circuit  Schemes 

Different  dual-Vth  schemes  are  considered  in  our  analysis: 

•  Gate  level  dual-  Vth  circuit  (DVT) 

All  the  transistors  within  one  gate  have  the  same  threshold  voltage. 

•  Mixed- V;,  type  1  (MVT1) 

There  is  no  mixed  Vth  in  p  pull-up  or  pull-down  trees. 

•  Mixed-  Vth  type  2  (MVT2) 

Mixed  V,i,  is  allowed  anywhere  except  for  the  series  connected  transistors  (transistors  in  a  stack). 

7.2  Delay  and  Power  Estimation  Methods 

Delay  information  can  be  achieved  by  the  following  methods: 

(a)  Elmore  delay  model 

(b)  Delay  look-up  table  based  on  HSPICE  simulations 

(c)  Pathmill  simulations 
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Dynamic  power  dissipation  can  be  simulated  by  a  Monte  Carlo  based  statistical  method,  where 
the  switching  at  internal  node  is  taken  into  account.  An  accurate  leakage  model  which  has  been  verified 
by  HSPICE  simulations  is  used  for  leakage  power  estimation.  The  stack  effect,  short  channel  effect  and 
body  effect  are  considered.  Considering  the  fact  that  leakage  current  depends  on  input  signals,  the  average 
leakage  power  can  be  evaluated  with  random  patterns  applied  to  primary  inputs. 

7.3  Algorithms  for  dual-Vth  assignment 

Three  algorithms  for  dual-Vth  assignment  have  been  developed  and  implemented: 

•  Back  tracing  algorithm  (O(n)) 

•  Priority  selection  algorithm  ( 0( n2) ) 

•  Priority-based  back  tracing  algorithm  ( 0( n ) ) 

Priority  selection  algorithm  shows  more  leakage  savings,  but  also  takes  more  CPU  time.  Back  tracing 
algorithm  is  the  fastest  one,  but  the  leakage  savings  are  less  than  the  other  two  algorithms.  For  priority- 
based  back  tracing  algorithm,  the  leakage  savings  are  close  to  that  of  priority  selection  algorithm  and  the 
run  time  is  similar  to  that  of  back  tracing  algorithm.  The  method  to  reduce  leakage  power  using  dual- 
threshold-voltage  transistors  has  been  implemented  in  C  under  the  Berkeley  SIS  environment.  Results 
show  that  there  is  an  optimal  high  threshold  voltage  for  the  best  leakage  savings  and  the  dual  threshold 
technique  is  good  for  leakage  power  reduction  during  both  standby  and  active  modes.  In  addition  to 
leakage  power  saving,  the  dynamic  power  is  also  reduced  due  to  the  reduction  of  internal  node  voltage 
swing  for  high  threshold  gates.  The  effectiveness  of  dual-Vf/,  design  technique  depends  on  the  circuit 
structure.  For  some  ISCAS  benchmark  circuits,  the  leakage  power  can  be  reduced  by  more  than  80%. 
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8.  Low  Power  BIST 


The  salient  features  of  this  research  are  given  below: 

8.1  POWERTEST 

Due  to  the  increasing  use  of  portable  computing  and  wireless  communications  systems,  power 
consumption  is  of  major  concern  in  today's  VLSI  circuits.  With  that  in  mind  we  present  a  low  power 
weighted  random  pattern  testing  technique  for  Built-In-Self-Test  (BIST)  applications.  Power  consumption 
during  BIST  operation  can  be  minimized  while  achieving  high  fault  coverage.  Simple  measures  of 
observability  and  controllability  of  circuit  nodes  are  proposed  based  on  primary  input  signal  probability 
(probability  that  a  signal  is  logic  ONE).  Such  measures  help  determine  the  testability  of  a  circuit.  We 
developed  a  tool,  POWERTEST,  which  uses  a  genetic  algorithm  based  search  to  determine  optimal 
probability  sets  (signal  probabilities  or  input  signal  distribution)  at  primary  inputs  to  trade-off  test  time 
versus  power  dissipation  and  fault  coverage.  The  inputs  conforming  to  the  primary  input  probability- 
activity  sets  can  be  generated  using  cellular  automata  or  LFSR  (Linear  Feedback  Shift  Register).  We 
observed  that  a  single  input  distribution  (or  weights)  may  not  be  sufficient  for  some  random-pattern 
resistant  circuits,  while  multiple  distributions  consume  larger  area.  As  a  trade-off,  two  distributions  have 
been  used  in  our  analysis.  Results  on  ISCAS  benchmark  circuits  show  that  power  reduction  of  up  to 
94.86%  and  energy  reduction  of  up  to  99.93%  can  be  achieved  (compared  to  equi-probable  random- 
pattern  testing)  while  achieving  high  fault  coverage. 

8.2  MACROTEST 

For  large  circuits,  GA-based  algorithm  is  computationally  expensive.  We  developed  an  alternative  tool, 
MACROTEST,  which  uses  a  macromodel  based  search  to  determine  optimal  primary  input  probability 
and  activity  (probability  of  switching)  set  (signal  probabilities  or  input  signal  distribution)  to  maximize  fault 
coverage  with  low  energy  consumption  and  can  trade-off  test  time  versus  energy  dissipation  and  fault 
coverage.  The  inputs  conforming  to  the  primary  input  probability  and  activity  set  can  be  generated  using 
cellular  automata  or  LFSR  (Linear  Feedback  Shift  Register).  Results  on  ISCAS  benchmark  circuits  show 
that  energy  reduction  of  up  to  98.25  %  can  be  achieved  (compared  to  equi-probable  random-pattern 
testing)  while  achieving  high  fault  coverage.  We  also  developed  a  cellular  automata  based  test  generator  to 
achieve  low-power  BIST. 
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9.  Low-Power  VLSI  Signal  Processing 

We  realized  that  large  improvements  in  power  dissipation  is  possible  at  high  levels  of  design  abstraction. 
This  research  focusses  on  design  techniques  to  reduce  power  dissipation  by  reducing  computational 
complexity  and  high  level  synthesis. 

9.1  Low-Complexity  Multiplierless  Filters 

We  present  a  computation  reduction  technique  which  can  be  used  to  obtain  multiplierless 
implementations  of  both  finite  and  infinite  impulse  response  (FIR)  and  (HR)  digital  filters,  respectively. 

The  main  idea  is  to  remove  computational  redundancy  by  reordering  computation.  Various  approaches 
are  investigated  which  consider  normal,  differential  and  hybrid  arrangements  for  storing  coefficients.  The 
frequency  response  of  the  filter  is  unaltered.  It  is  shown  that  the  reordering  problem  can  be  formulated 
using  a  graph  in  which  vertices  represent  the  coefficients  and  edges  represent  resources  required  in  a 
computation  involving  the  coefficient  order  specified  by  vertices.  We  present  various  approaches  for 
exploiting  computational  redundancy  reduction  and  the  overheads  involved.  A  major  advantage  of  this 
methodology  is  that  it  is  independent  of  the  number  representation  scheme  and  the  world-length  of 
coefficients.  Simple  polynomial  run  time  algorithms  are  presented  and  their  power  and  potential  is 
demonstrated  by  presenting  results  for  large  filters  ( lengths  up  to  >  300)  which  show  that  less  than  2  add 
operations  per  coefficient  are  required.  Hence,  these  filters  can  be  used  in  low-power  and/or  high-speed 
applications  where  data  can  be  processed  in  blocks. 

The  main  idea  of  our  work  is  to  find  an  ordering  of  coefficients  which  minimizes  the  number  of  adders 
required  in  the  filter  implementation  using  a  graph  theoretic  approach.  We  employ  a  differential  coefficient 
scheme  which  can  be  implemented  for  any  coefficient  ordering  in  either  FIR  or  HR  filters.  Hence,  the 
work  proposed  can  be  viewed  as  a  special  case  of  the  more  general  frame-work  presented  in  this  work. 
Using  the  proposed  differential  coefficients  multiplierless  implementation  (DCMI)  scheme,  one  can  obtain 
multiplierless  implementations  which  yields  less  than  2  adders  per  coefficient  as  demonstrated  later.  The 
main  contributions  of  this  work  are  summarized  below: 

•  The  frequency  response  of  the  given  filter  is  not  altered. 

•  DCMI  approach  is  independent  of  the  number  representation  scheme  used  and  our  choice  of  the 

number  of  bits  to  represent  the  coefficients. 

•  Solution  of  DCMI  problem  are  solutions  to  well-known  graph  theoretic  problems. 

Efficient  polynomial  time  algorithms  can  be  employed  to  obtain  "good"  solutions. 

•  The  frame-work  presented  in  this  work  can  account  for  more  general  problems  which  consider 
memory  overheads  (by  modifying  edge  costs),  or,  when  given  fixed  resources  (by  solving  a  graph 
partitioning  problem). 

In  summary,  our  approach  can  be  used  to  obtain  a  unified  frame-work  in  which  low  complexity  and 
low  power  block  FIR  filters  can  be  obtained  without  compromising  frequency  response  characteristics  of  a 
given  optimal  filter.  Hence,  it  offers  a  very  powerful  compliment  to  the  existing  methodologies  for 
reducing  filter  complexity  in  the  domain  of  high-performance  block  filtering. 

One  may  note  that  there  are  two  ways  to  obtain  reduction  in  power  dissipation  using  this  approach. 

First,  we  get  a  direct  reduction  in  power  dissipation  due  to  removal  of  redundant  computation.  This 
advantage  appears  in  the  form  of  reduced  switching  activity  because  of  relatively  fewer  computational 
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operations.  Second,  we  can  obtain  multiplierless  implementations,  which  are  of  immense  interest  in  high¬ 
speed  signal  processing  applications,  and,  which  can  also  be  used  to  further  reduce  power  levels  by 
employing  voltage  scaling. 

Consider  a  linear  time -invariant  (LTI)  FIR  filter  of  length  M  described  by  an  input-output  relationship 
of  the  form 


M-l  M-l 

y(n)  =  S  CjX(n  -  i)  =  S  pjn) 

i-0  i=0 


(i) 


In  this  context,  c,  represents  the  ith  coefficient  and  x(n  -  i)  denotes  the  data  sample  at  time  instant 
n  -  i.  Pj( n )  represents  the  partial  product  cpc(n  -  i)  for  i  =  0,  M  -  1  computed  at  time  instant  n.  Figure  1 
shows  a  graph  G  =  {V,  E}  representation  of  a  4-tap  (M  =  4)  FIR  filter  in  which  each  vertex  represents  a 
coefficient  and  the  edge  £,  ,  i,  j  =  0,  1,  2,  3  represents  the  resources  required  to  multiply  a  data  sample 
with  the  preceding  vertex  (i.e.  coefficient  c,).  If  an  array  multiplier  is  used  to  compute  the  products,  EitJ 
represents  the  number  of  rows  required  cf  adders  required  to  implement  the  multiplier  and  given  as  the 
number  of  1-bits  in  c,.  M  =  4  parallel  multipliers  are  required  to  obtain  a  parallel  implementation  of  the  M- 
tap  filter.  Ey  depends  only  on  the  number  representation  scheme  and  the  type  of  multiplier  employed. 
Further,  the  G  is  undirected  and  Etj  =  Eji  for  all  i,  i  =  0,  1,  .  .  M  -  1 . 


Figure  1 :  Graph  representation  of  an  example  filter  with  M  =  4 
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Figure  2:  Graph  representation  of  an  example  filter  with  M  =  8. 


With  the  above  interpretation  of  the  graph,  the  output  in  equation  1  can  be  calculated  by  a  tour  along 
the  graph  at  time  instant  n.  Figure  2(a)  shows  one  such  tour  in  G  which  consists  of  edges  Ei{MkmodM>  i  =  0, 
1,...,  M  - 1  for  an  M  =  8  tap  filter.  The  coefficients  are  applied  such  that  c/+/  follows  cp  j  =  0,...,  M  -  2. 
The  appropriate  data  sample  with  the  corresponding  coefficient  are  shown  next  to  the  edges.  The  total 
resources  required  to  compute  the  output  given  by  equation  1  at  time  instant  n  is  given  by  the  sum  of 
resources  required  to  compute  the  partial  (p(n)  ’s)  along  each  edge  in  the  tour.  At  the  next  time  instant,  n 
+  1,  each  data  sample  x(i),  i  =  n,  n  - 1  ....  n  -  M  +  1  in  the  graph  is  replaced  by  x(i  +  1 ).  The  outputs  of 
the  filter  at  time  instants  n  -  1  and  n  are  given  as 


y(n  -  1)  =  c0  x(n  - 1)  +  Cj  x(n  -  2)  +  ...  +  cM.h  x(n  -  M) 


_  n  (n-1)  p  (n-1)  .  p  (n-1) 

—  *0  +  *  1  +•••  +  L  M-l 


y(n)  =  c0x(n)  +  qrfn  -  1)  +  ...  +  cM-ix(n  -  M  +  1) 


-  n  (n)  -l-  n  (n)  4-  J-  n  (n) 

-  P()  +  Pi  +•••  +  Pm-i 


(2) 

(3) 


Consider  the  tour  in  figure  2(b).  Suppose  that  this  order  yields  differential  coefficients  which  are 
simpler  to  implement  (e.g.  they  may  be  powers-of-two),  and  hence,  the  implementation  so  obtained  has 
lower  complexity.  Note  that  in  this  example,  the  ordering  is  given  by  c0,  c4,  c5,  c  h  c2,  c6,  c7,  c3.  The 
corresponding  data  sample  x( n  -  i)  migrates  from  the  edge  Ey  to  Eik,  such  that  if  T  is  the  new  tour,  Eik 
g  T',k±  j.  This  is  shown  in  figure  2  which  shows  that  x(n  -  i )  now  refers  to  the  p(m  new  edge 
originating  at  Cj.  Next,  we  can  calculate  Pf  for  i  =  0,1,...,  M  -  1.  For  simplicity  in  notation,  let  K=  j 
k0,k1,...,kM_i,}  be  the  set  representing  the  indices  of  coefficients  in  the  new  ordering.  Hence,  for  the 
example  in  figure  2(b),  K  =  {0,  4,5, 1,2, 6,7, 3}.  Then,  the  new  differential  coefficients  for  the  order 
sequence  in  K  are  given  by  c,  =  cki+1  -  Cki,  i  =  0,  1,...,  M  -  1. 

The  implementation  which  constructs  a  tour  with  least  number  of  resources  (total  number  of  adders) 
can  be  obtained  by  computing  the  Hamiltonian  path  with  smallest  weight  in  G.  The  Hamiltonian  cycle  can 
be  solved  by  employing  one  of  the  known  methods  of  solving  the  traveling  salesman  problem  (TSP). 
Hence,  the  DCMI  approach  computes  the  Hamiltonian  path  in  G. 
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Figures  3  and  4  show  a  relative  comparison  of  the  average  number  of  adders  per  differential  coefficient 
obtained  using  the  first-order  DCMI  solutions  for  sign-magnitude  (SM)  and  signed-power-of-two  (SPT) 
number  representations,  respectively.  We  compare  the  number  of  adders  per  differential  coefficient  for  8, 
16  and  24  bit  coefficients.  The  example  filters  considered  were  28-tap  PM,  41-tap  LS,  119-tap  PM,  172- 
tap  LS,  131-tap  PM,  170-tap  LS,  151-tap  PM,  217-tap  LS,  respectively.  These  results  were  obtained 
using  the  greedy  strategy  for  first-order  DCMI.  We  note  that  SPT  implementations  require  less  adders 
than  SM  implementations  for  all  word-lengths.  We  also  observe  a  linear  relationship  between  the  average 
number  of  adders  per  differential  coefficient  with  the  word-length.  Further,  the  average  number  of  adders 
per  differential  coefficient  reduces,  in  general,  as  the  length  of  the  filter  increases.  We  note  that  traditional 
approaches  of  finding  multiplierless  implementations  for  word-lengths  >16  would  take  enormous 
computational  effort  and  may  not  yield  good  solutions.  In  contrast,  our  technique  takes  polynomial  time, 
independent  of  the  word-length  and  the  number  representation  scheme,  and  can  be  used  to  obtain  good 
DCMI  solutions  for  large  filters  within  a  few  minutes  of  CPU  time. 
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1  2  3  4  5  6  7  8 

Example  Filter  No. 


Figure  3:  Average  Number  of  Adders  per  First-Order  Differential  Coefficient  For  Sign-Magnitude 
Number  Representation. 
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Figure  4:  Average  Number  of  Adders  per  First-Order  Differential  Coefficient  For  SPT  Number 
Representation. 
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